CS-194-26: Project 4, Parts 1-3

Abhijay Bhatnagar

Part 1: Nose Tip Detection

Dataloader

We will begin by loading in the imm_face_db and processing the images and annotations. We will process them in grayscale and normalized float values, and will additionally separate the data into the training set and validation set as described in the specs.

CNN

From here, we are going to use our loaded data to train a CNN to detect the nose tips of the individuals. To begin with, we will create the following network structure:

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (conv3): Conv2d(16, 25, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=1000, out_features=20, bias=True)
  (fc2): Linear(in_features=20, out_features=2, bias=True)
)

Loss

Now after training this network on the training data for 25 epoches, we plot the training and validation loss per epoch.

For reference, here are the best results from that batch, with the predictions plotted in blue, and the true landmarks in red.

As well the worst 2 results...

For both of these cases, it looks like it incorrectly predicted which orientation the face was in, and more specifically, perhaps what shadow or contour corresponds to the nose. It's a low feature NN, so it's likely just too underfit.

Part 2: Full Facial Keypoints Detection

Now we will attempt to do the same thing but with the entire keypoint space. We will also augment the data in order to reduce the overfitting.

Here are a few images from the sample, notice the varying brightness and contrast.

And here is the complete architecture of my network.

FullNet(
  (conv1): Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(4, 16, kernel_size=(3, 3), stride=(1, 1))
  (conv3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (conv5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1))
  (conv6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=1024, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=116, bias=True)
)

For my hyper parameters, I mostly picked them at random. I used 6 conv layers as recommended in the spec, each followed by a ReLu layer, and a max pool layers on the first few layers. My learning rate of 0.001 was sufficient in reaching a good enough loss.

Loss

I trained this network on the first 80% of the dataset as training, and here is the loss of training and validation.

Here are a couple of the best predictions...

As well as the worst...

It's pretty clear from looking at some of the worst performing results that the network is not very good at facial turns. It is likely overfitting on the standard orientation.

Visualizing the Filters

proj4_prob3

CS-194-26: Project 4 Part 3

Abhijay Bhatnagar

Dataloader

We will begin by utilizing the staff code to extract the image filenames, landmarks, and bounding boxes. Adjustments are made to the bounding box to ensure the resized and cropped version of the photo contains all landmarks. Afterwards, we will load the entire set of cropped images into memory for faster processing. A few raw images, and their processed bounding boxes and landmarks are shown below.

In addition to the original images, we are also going to augment the data with some contrast and brightness adjusted copies of the original images. I've included the augmented data that was based on a few of the previously shown samples.

Successfully Augmented

Part 2: CNN

Here I used the pretrained resnet18 model. I train this model on my entire dataset. In the following diagrams, I plot the loss, as well as exact and predicted points for a set of faces in the dataset.

For my hyperparameters, I adjusted only the suggested modifications to the first and last layer, and picked my learning rate of $0.001$ experimentally.

Exact Model Architecture:

Model's state_dict:
conv1.weight 	 torch.Size([64, 1, 7, 7])
bn1.weight 	 torch.Size([64])
bn1.bias 	 torch.Size([64])
bn1.running_mean 	 torch.Size([64])
bn1.running_var 	 torch.Size([64])
bn1.num_batches_tracked 	 torch.Size([])
layer1.0.conv1.weight 	 torch.Size([64, 64, 3, 3])
layer1.0.bn1.weight 	 torch.Size([64])
layer1.0.bn1.bias 	 torch.Size([64])
layer1.0.bn1.running_mean 	 torch.Size([64])
layer1.0.bn1.running_var 	 torch.Size([64])
layer1.0.bn1.num_batches_tracked 	 torch.Size([])
layer1.0.conv2.weight 	 torch.Size([64, 64, 3, 3])
layer1.0.bn2.weight 	 torch.Size([64])
layer1.0.bn2.bias 	 torch.Size([64])
layer1.0.bn2.running_mean 	 torch.Size([64])
layer1.0.bn2.running_var 	 torch.Size([64])
layer1.0.bn2.num_batches_tracked 	 torch.Size([])
layer1.1.conv1.weight 	 torch.Size([64, 64, 3, 3])
layer1.1.bn1.weight 	 torch.Size([64])
layer1.1.bn1.bias 	 torch.Size([64])
layer1.1.bn1.running_mean 	 torch.Size([64])
layer1.1.bn1.running_var 	 torch.Size([64])
layer1.1.bn1.num_batches_tracked 	 torch.Size([])
layer1.1.conv2.weight 	 torch.Size([64, 64, 3, 3])
layer1.1.bn2.weight 	 torch.Size([64])
layer1.1.bn2.bias 	 torch.Size([64])
layer1.1.bn2.running_mean 	 torch.Size([64])
layer1.1.bn2.running_var 	 torch.Size([64])
layer1.1.bn2.num_batches_tracked 	 torch.Size([])
layer2.0.conv1.weight 	 torch.Size([128, 64, 3, 3])
layer2.0.bn1.weight 	 torch.Size([128])
layer2.0.bn1.bias 	 torch.Size([128])
layer2.0.bn1.running_mean 	 torch.Size([128])
layer2.0.bn1.running_var 	 torch.Size([128])
layer2.0.bn1.num_batches_tracked 	 torch.Size([])
layer2.0.conv2.weight 	 torch.Size([128, 128, 3, 3])
layer2.0.bn2.weight 	 torch.Size([128])
layer2.0.bn2.bias 	 torch.Size([128])
layer2.0.bn2.running_mean 	 torch.Size([128])
layer2.0.bn2.running_var 	 torch.Size([128])
layer2.0.bn2.num_batches_tracked 	 torch.Size([])
layer2.0.downsample.0.weight 	 torch.Size([128, 64, 1, 1])
layer2.0.downsample.1.weight 	 torch.Size([128])
layer2.0.downsample.1.bias 	 torch.Size([128])
layer2.0.downsample.1.running_mean 	 torch.Size([128])
layer2.0.downsample.1.running_var 	 torch.Size([128])
layer2.0.downsample.1.num_batches_tracked 	 torch.Size([])
layer2.1.conv1.weight 	 torch.Size([128, 128, 3, 3])
layer2.1.bn1.weight 	 torch.Size([128])
layer2.1.bn1.bias 	 torch.Size([128])
layer2.1.bn1.running_mean 	 torch.Size([128])
layer2.1.bn1.running_var 	 torch.Size([128])
layer2.1.bn1.num_batches_tracked 	 torch.Size([])
layer2.1.conv2.weight 	 torch.Size([128, 128, 3, 3])
layer2.1.bn2.weight 	 torch.Size([128])
layer2.1.bn2.bias 	 torch.Size([128])
layer2.1.bn2.running_mean 	 torch.Size([128])
layer2.1.bn2.running_var 	 torch.Size([128])
layer2.1.bn2.num_batches_tracked 	 torch.Size([])
layer3.0.conv1.weight 	 torch.Size([256, 128, 3, 3])
layer3.0.bn1.weight 	 torch.Size([256])
layer3.0.bn1.bias 	 torch.Size([256])
layer3.0.bn1.running_mean 	 torch.Size([256])
layer3.0.bn1.running_var 	 torch.Size([256])
layer3.0.bn1.num_batches_tracked 	 torch.Size([])
layer3.0.conv2.weight 	 torch.Size([256, 256, 3, 3])
layer3.0.bn2.weight 	 torch.Size([256])
layer3.0.bn2.bias 	 torch.Size([256])
layer3.0.bn2.running_mean 	 torch.Size([256])
layer3.0.bn2.running_var 	 torch.Size([256])
layer3.0.bn2.num_batches_tracked 	 torch.Size([])
layer3.0.downsample.0.weight 	 torch.Size([256, 128, 1, 1])
layer3.0.downsample.1.weight 	 torch.Size([256])
layer3.0.downsample.1.bias 	 torch.Size([256])
layer3.0.downsample.1.running_mean 	 torch.Size([256])
layer3.0.downsample.1.running_var 	 torch.Size([256])
layer3.0.downsample.1.num_batches_tracked 	 torch.Size([])
layer3.1.conv1.weight 	 torch.Size([256, 256, 3, 3])
layer3.1.bn1.weight 	 torch.Size([256])
layer3.1.bn1.bias 	 torch.Size([256])
layer3.1.bn1.running_mean 	 torch.Size([256])
layer3.1.bn1.running_var 	 torch.Size([256])
layer3.1.bn1.num_batches_tracked 	 torch.Size([])
layer3.1.conv2.weight 	 torch.Size([256, 256, 3, 3])
layer3.1.bn2.weight 	 torch.Size([256])
layer3.1.bn2.bias 	 torch.Size([256])
layer3.1.bn2.running_mean 	 torch.Size([256])
layer3.1.bn2.running_var 	 torch.Size([256])
layer3.1.bn2.num_batches_tracked 	 torch.Size([])
layer4.0.conv1.weight 	 torch.Size([512, 256, 3, 3])
layer4.0.bn1.weight 	 torch.Size([512])
layer4.0.bn1.bias 	 torch.Size([512])
layer4.0.bn1.running_mean 	 torch.Size([512])
layer4.0.bn1.running_var 	 torch.Size([512])
layer4.0.bn1.num_batches_tracked 	 torch.Size([])
layer4.0.conv2.weight 	 torch.Size([512, 512, 3, 3])
layer4.0.bn2.weight 	 torch.Size([512])
layer4.0.bn2.bias 	 torch.Size([512])
layer4.0.bn2.running_mean 	 torch.Size([512])
layer4.0.bn2.running_var 	 torch.Size([512])
layer4.0.bn2.num_batches_tracked 	 torch.Size([])
layer4.0.downsample.0.weight 	 torch.Size([512, 256, 1, 1])
layer4.0.downsample.1.weight 	 torch.Size([512])
layer4.0.downsample.1.bias 	 torch.Size([512])
layer4.0.downsample.1.running_mean 	 torch.Size([512])
layer4.0.downsample.1.running_var 	 torch.Size([512])
layer4.0.downsample.1.num_batches_tracked 	 torch.Size([])
layer4.1.conv1.weight 	 torch.Size([512, 512, 3, 3])
layer4.1.bn1.weight 	 torch.Size([512])
layer4.1.bn1.bias 	 torch.Size([512])
layer4.1.bn1.running_mean 	 torch.Size([512])
layer4.1.bn1.running_var 	 torch.Size([512])
layer4.1.bn1.num_batches_tracked 	 torch.Size([])
layer4.1.conv2.weight 	 torch.Size([512, 512, 3, 3])
layer4.1.bn2.weight 	 torch.Size([512])
layer4.1.bn2.bias 	 torch.Size([512])
layer4.1.bn2.running_mean 	 torch.Size([512])
layer4.1.bn2.running_var 	 torch.Size([512])
layer4.1.bn2.num_batches_tracked 	 torch.Size([])
fc.weight 	 torch.Size([136, 512])
fc.bias 	 torch.Size([136])

And here is a graph of the Training and Validation loss over time.

For reference, I've included several images from the dataset with their original keypoints (in red) and the CNN's predicted keypoints (in cyan).

Kaggle Test Data

We will now run our model on the Kaggle Test Data. The predictions for several faces are shown below. Some turn out quite accurate, particularly for level brightness straight on faces.

For the ones that are off, it especially struggles with rotational invariance and blockage of the face with items or hands. It also seems to bias towards a smile, as it misses the details of frowns.

Submitted to Kaggle with the Username AbhijayBerkeley. I received a score of: 22.55688.

3 Photos Test

Here I've included the net's results on 3 random photos. It gets all three mostly right, but you can see that espexially for the last one, it struggles to position the eyes.